Fault-tolerance Mechanisms in the Sb-pram Multiprocessor
نویسندگان
چکیده
The SB-PRAM is an experimental multiprocessor architecture with a shared address space and synchronously running threads, i.e. giving the illusion to work on a PRAM. A 4-processor prototype has been completed while a 64processor prototype is under construction. We investigate the detection and handling of single bit errors occuring during transmission of packets in the interconnection network. We analyze the impact of an error on the different parts of a packet and derive several strategies to recover from such an error. The strategies range from single bit correction codes to checkpointing the application and roll back in case of error. We find that the changes necessary in hardand system software are small. In particular, none of the ASICs designed for the SB-PRAM have to be changed. The runtime overhead due to the fault-tolerance mechanisms can be neglected. Finally, we sketch how these strategies can be extended to cover component failures.
منابع مشابه
Controlling Memory Access Concurrency in Efficient Fault-Tolerant Parallel Algorithms
The CRCW PRAM under dynamic fail-stop (no restart) processor behavior is a fault-prone multiprocessor model for which it is possible to both guarantee reliability and preserve eeciency. To handle dynamic faults some redundancy is necessary in the form of many processors concurrently performing a common read or write task. In this paper we show how to signiicantly decrease this concurrency by bo...
متن کاملHardware-Supported Fault Tolerance for Multiprocessors
To provide a computing system to be dependable fault tolerance mechanisms have to be included. Especially massive parallelism represents a new challenge for fault tolerance. In this paper we discuss basic hardware fault tolerance measures for massively parallel multiprocessors and solutions realized for and integrated into different multiprocessor architectures. Further we present our validatio...
متن کاملAnalysis of Selective Fault - Tolerant , Hard Real - Time
An increasing number of applications are demanding real-time performance from their multiprocessor systems. For many of these applications, a failure may produce disastrous results. Such failures are avoided in hard real-time systems by the use of fault-tolerance. In hard real-time multiprocessor scheduling, this fault tolerance may be provided by including several task backups in each schedule...
متن کاملFault Tolerance for Multiprocessor Systems Via Time Redundant Task Scheduling
Fault tolerance is often considered as a good additional feature for multiprocessor systems but nowadays it is becoming an essential attribute. Fault tolerance can be achieved by the use of dedicated customized hardware that may have the disadvantage of large cost. Another approach to fault tolerance is to exploit existing redundancy in multiprocessor systems via a task scheduling software stra...
متن کاملAdaptable Fault Tolerance Configurations for Multiprocessor Systems
The escalating increase in the complexity of multiprocessor systems increases the probability of faults occurring in these systems As a consequence there is a great need for achieving fault-tolerance of processing in multiprocessor systems. Faulttolerance generally requires some forms of hardware and/or time redundancy. Two fault tolerant configurations are proposed for both single and double t...
متن کامل